Introduction to Data Science (MATH 4100/ COMP 5360), University of Utah.
Collaboration between the team was done on git: https://github.com/ArithmeticR/COMP_5360_Project
Brian has a background in User Experience Design. He is interested in using data to influnece interaction design.
Li has a background in hydroinformatics, especially the big data in water resource management. He is interested in data mining and data visualization
Trevor has a background in marketing research. He is interested in developing models for pricing products and identifying attributes to use in marketing.
The primary objective is to use natural language processing on wine reviews to classify the category, country, and taster by machine learning. The secondary objective is to build an artificial neural network that classifies all three at once, then examine the output of the network ran backwards. The purpose being to discover insights into the different combinations of category, country, and taster.
Brian - Learn more about how to apply data analysis to solve marketing issues. Learn more about NLP and processing emotion in writing.
Jiada - Learn more about how to visualize big data to dig out the valuable information which is hidden Benefits: With the data mining we can help people to change consumption behavior and save water
Trevor - Learn more about NLP, and about ensemble methods.
The data for the project was collected from https://www.winemag.com. The site contains 215,395 reviews of wines that include the price. The https://www.winemag.com/robots.txt site specifies the crawl-delay to be 5. Requiring around 300 hours (12.5 days) to scrape all the reviews from a single computer.
Url - Where the review is hosted.
Title - Name of the review.
Points - Number of points given in the review. On a 100 point scale.
Description - The taster's review of the wine.
Price - Price of the wine. We think it's all in dollars.
Variety - Type of wine.
Appellation - a name or title of the wine.
Winery - Location the wine was produced.
Alcohol - Percent of achohol.
Bottle Size - Size of the wine bottle. It might all be in millilitres.
Category - Category of the wine.
Importer - Name of the importer.
Date Published - Date the review was published.
User Avg Rating - Rating of the review.
Taster - Name of the reviewer.
############################################
###
### Don't Run only for presentation
###
############################################
import pandas as pd
import scipy as sc
import numpy as np
from bs4 import BeautifulSoup
import requests
import urllib.request
import pickle
import glob
import re
import time
import statsmodels.formula.api as sm
Extracting the data was very similar to the web scraping homework for the github repositories. The first step was scapping the url for the individual pages from the site filtering on price. This ensured that only reviews with price were pulled.
One thing to point out about the code (see * below) is the way the crawl delay was implemented. The process slept for 5 seconds minus the number of seconds transpired since the individual url request began.
############################################
###
### Don't Run only for presentation
###
############################################
session = requests.Session()
HEADERS = {
'user-agent': ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36')
}
first_page = 1
last_page = 7180
results_url = "https://www.winemag.com/?s=&drink_type=wine&price=1.0-15.99,16.0-25.99,100.0-199.99,76.0-99.99,61.0-75.99,41.0-60.99,26.0-40.99,200.0-*&page="
raw_pages = []
for i in range(first_page, last_page + 1 ):
time_from_request = time.time() # (*)
url = results_url + str(i)
print(i)
response = session.get(url, headers=HEADERS)
my_page = BeautifulSoup(response.content, 'html.parser')
raw_review_urls = [ review.get("href") for review in my_page.select(".review-item a")]
clean_review_urls = [ my_url for my_url in raw_review_urls if bool(re.search(r'^https://www.winemag.com/buying-guide/', my_url))]
pickle.dump( clean_review_urls, open( "urls/raw_pages"+str(i)+".p", "wb" ) )
if time.time() - time_from_request < 5: # (*)
time.sleep(5.01 - (time.time() - time_from_request))
Not shown in the code above is the step where all the pickled files are concatenated together to form a master list of the individual review urls.
Each review is written out to disk once it is scraped. These are some of the benefits from using this approach:
This is the function used to write each review out to disk. Notice that it wasn't exactly clear what fields would be contained in the primary and secondary blocks. The function concatenates the field name and values with "||||". The assumption is that "||||" is unique enough not to appear in the field name.
############################################
###
### Don't Run only for presentation
###
############################################
def write_my_file(i, raw_review_pages):
file = open("reviews/url_" + str(i) + ".txt", "w")
x = raw_review_pages
url_i = x[0]
title = x[1]
points = x[2]
description = x[3]
taster = x[8]
primary_info_label = x[4]
primary_info = x[5]
secondary_info_label = x[6]
secondary_info = x[7]
file.write(str(url_i).replace('\n', '').replace('\t', ''))
file.write("\t")
file.write(str(title).replace('\n', '').replace('\t', ''))
file.write("\t")
file.write(str(points).replace('\n', '').replace('\t', ''))
file.write("\t")
file.write(str(description).replace('\n', '').replace('\t', ''))
file.write("\t")
file.write(str(taster).replace('\n', '').replace('\t', ''))
file.write("\t")
for y, z in zip(primary_info_label, primary_info):
file.write(str(y).replace('\n', '').replace('\t', '').replace('<span>', '').replace('</span>', ''))
file.write("||||")
z = str(z).replace('\n', '').replace('\t', '')
z = re.sub(r"<.+?>", "", z)
file.write(z)
file.write("\t")
for y, z in zip(secondary_info_label, secondary_info):
y = str(y).replace('\n', '').replace('\t', '').replace('<span>', '').replace('</span>', '')
if y != "User Avg Rating":
file.write(str(y).replace('\n', '').replace('\t', '').replace('<span>', '').replace('</span>', ''))
file.write("||||")
z = str(z).replace('\n', '').replace('\t', '')
z = re.sub(r"<.+?>", "", z)
file.write(z)
file.write("\t")
file.write("\n")
file.close()
This code uses the file 'not_picked.csv'. This file contains all the urls that had not been scrapped by the beginning of the process. The code to create this file is not shown, but this is done by preforming a set difference on the master url list with the urls that have already been pulled. This was done twice to ensure all the urls were indeed pulled. Some of the page requests failed the first time they were tried, and others never worked.
The code checks at each iteration whether the file with the 'page_i_in_loop' index already exists. The cost in inefficiency was offset by the ease of restarting the process upon failure.
############################################
###
### Don't Run only for presentation
###
############################################
df = pd.read_csv('not_picked.csv')
all_urls = df.x.values.tolist()
session = requests.Session()
HEADERS = {
'user-agent': ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36')
}
first_page = 1
last_page = 215395
for page_i_in_loop in range(first_page, last_page + 1):
time_from_request = time.time()
my_files = glob.glob("reviews/*.txt")
## check string twice because the differences in windows and linux machines
if "reviews/url_" + str(page_i_in_loop) + ".txt" not in my_files and "reviews\\url_" + str(page_i_in_loop) + ".txt" not in my_files :
url_i = all_urls[page_i_in_loop]
try:
response = session.get(url_i, headers=HEADERS)
soup_review_page = BeautifulSoup(response.content, 'html.parser')
structure_reviews = []
try:
title = soup_review_page.select(".heading-area .article-title")[0].text
except:
title = None
try:
points = soup_review_page.select(".rating #points")[0].text
except:
points = None
try:
description = soup_review_page.select(".description")[0].text
except:
description = None
try:
primary_info_label = soup_review_page.select(".primary-info .row .info-label span")
except:
primary_info_label = None
try:
primary_info = soup_review_page.select(".primary-info .row .info")
except:
primary_info = None
try:
secondary_info = soup_review_page.select(".secondary-info .row .info")
except:
secondary_info = None
try:
secondary_info_label = soup_review_page.select(".secondary-info .row .info-label span")
except:
secondary_info_label = None
try:
taster = soup_review_page.select(".taster .name")[0].text
except:
taster = None
print(page_i_in_loop)
structure_reviews=[url_i, title, points, description, primary_info_label,
primary_info, secondary_info_label,
secondary_info, taster]
print(page_i_in_loop)
write_my_file(page_i_in_loop, structure_reviews)
except Exception as e:
print(str(e))
if time.time() - time_from_request < 5:
time.sleep(5.01 - (time.time() - time_from_request))
Truth be told, collecting the data was more of a pain than first thought. First we tried pickling the reviews in batches, but the files sizes were enormous. Still not sure exactly why. Maybe it was because of the string encoding or the nested structure we were using. Some time was wasted before switching to text files for storage (which greatly reduced the size). This was a bit of a time consuming process and we needed 13 days to pull the data on a single computer.
So, what ended up happening was running 10 linux servers (The max number allow by default at digitalocean.com) and collecting the reviews once the process was refined. This code above is very similar to the code ran on these machines. The github repo contains 24 branches for this reason.
The right encoding was difficult to find in python (encoding = "ISO-8859-1"). For this reason, R was initially used to consolidate the reviews. For whatever reason, R was able to read the files in without any encoding issues. Some care had to be taken to handle the fields from the primary and secondary blocks.
######################################
###
#### R Code
###
#######################################
############################################
###
### Don't Run only for presentation
###
############################################
all_files <- list.files(path = "reviews", pattern = ".txt",
all.files = FALSE,
full.names = T, recursive = FALSE,
ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
my_input <- matrix(NA,nrow=300000 ,ncol=20)
myinput_list <- vector("list", length = 300000)
i = 1
for(my_file in all_files){
con = file(my_file, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
myinput_list[i] <-strsplit(line,"\t")
i = i + 1
}
close(con)
}
myinput_list2 <- purrr::compact(myinput_list)
myinput_list3 <- myinput_list2
my_matrix <- matrix(NA, nrow=length(myinput_list3), ncol=15)
for(my_name in seq_along(myinput_list3)){
my_length <- length(myinput_list3[[my_name]])
if(my_length>12){
my_matrix[my_name,1:my_length] <- myinput_list3[[my_name]]
}
}
my_df <- as.data.frame(my_matrix,stringsAsFactors=FALSE)
my_df2 <- my_df[!duplicated(my_df$V1),]
colnames(my_df2) <- c('url','title','points','description','taster',paste0('V',6:15))
table(my_df2$taster)
my_df2[1,]
save(my_df2,file="unstructured_df.rdata")
Some basic cleaning and transformations were done in R, since the encoding still had not been figured out for python.
The code for handling the primary and secondary blocks (code between the (*) comments) is very inefficient, but was offset by the ease of coding.
The files were chunked into 20,000 review blocks. This seemed to be the right number to keep the chunks under the 25 mb limit for pushing to github.
######################################
###
#### R Code
###
#######################################
############################################
###
### Don't Run only for presentation
###
############################################
load(file="unstructured_df.rdata")
# (*) handling the primary and secondary blocks
for(i in 1:nrow(my_df2)){
for(j in 6:15){
my_split <- strsplit(my_df2[i,j],"||||",fixed=T)[[1]]
if(!make.names(my_split[1]) %in% colnames(my_df2)){
my_df2[[make.names(my_split[1]) ]] <- NA
}
my_df2[i,make.names(my_split[1]) ] <- my_split[2]
}
}
# (*)
my_df3 <- my_df2[, -c(6:15)]
my_df3 <- my_df3[, -c(15)]
my_df3$Price2 <- as.numeric(gsub("$","",gsub(", Buy Now","",my_df3$Price), fixed = T))
my_df3$Alcohol2 <- as.numeric(gsub("%","",my_df3$Alcohol))/100
my_df3$Bottle.Size2 <- gsub("ml","",gsub(" ","",tolower(my_df3$Bottle.Size), fixed = T))
my_df3$milliliters <-ifelse(grepl("l",my_df3$Bottle.Size2),
as.numeric(gsub("l","",my_df3$Bottle.Size2)) * 1000,
as.numeric(my_df3$Bottle.Size2))
my_df3$price_per_liter <- 1000*my_df3$Price2/ my_df3$milliliters
Appellations <- strsplit(my_df3$Appellation,",")
my_df3$l1<- NA
my_df3$l2<- NA
my_df3$l3<- NA
my_df3$l4<- NA
my_df3$l5<- NA
i=2
for(i in 1:nrow(my_df3)){
my_df3[i,paste0('l',(5-length(Appellations[[i]])+1):5)] <- trimws(Appellations[[i]])
}
names_vector <- c("url" , "Date.Published","title", "taster" ,
"Alcohol" , "Alcohol2",
"Bottle.Size","Bottle.Size2", "milliliters" ,
"points","Price", "Price2" ,"price_per_liter",
"Importer" , "Winery" ,
"Appellation" , "l1" , "l2" ,"l3" , "l4", "l5"
, "Designation","Category" , "Variety" , "description")
my_df3 <- my_df3[,names_vector]
for(i in 1:10){
write.table(my_df3[((i-1)*20000+1):((i)*20000),],
file=paste0("structured_df_",i,".txt"), sep="\t", row.names = F)
}
write.table(my_df3[200001:214103,],file=paste0("structured_df_",11,".txt"), sep="\t", row.names = F)
save(my_df3,file="structured_df.rdata")
write.table(my_df3,file="structured_df.txt", sep="\t", row.names = F)
%%javascript
require.config({
paths: {
highcharts: "http://code.highcharts.com/highcharts",
highcharts_exports: "http://code.highcharts.com/modules/exporting",
},
shim: {
highcharts: {
exports: "Highcharts",
deps: ["jquery"]
},
highcharts_exports: {
exports: "Highcharts",
deps: ["highcharts"]
}
}
});
from PIL import Image
from wordcloud import WordCloud
import math
import numpy as np
import scipy as sc
import pandas as pd
import pylab
import scipy.stats as stats
import statsmodels.api as sm
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn import linear_model
from sklearn.utils import shuffle
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('ggplot')
from sklearn.externals import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import VotingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import TruncatedSVD
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import ElasticNet
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
The right encoding turned out to be "ISO-8859-1". The files had been chunked into file sizes below 25 mbs. This allowed them to be stored on github.
df = pd.read_csv('structured_df_1.txt',sep='\t', encoding = "ISO-8859-1")
for i in range(1,12):
df = df.append(pd.read_csv('structured_df_'+str(i)+'.txt',sep='\t', encoding = "ISO-8859-1"))
df = df.drop_duplicates()
df.shape
df2 = df[df.description.notnull()]
df2 = df2[df2.points.notnull()]
df2 = df2[df2.price_per_liter.notnull()]
df2.shape
df2[['price_per_liter_clip']]= df2[['price_per_liter']].clip(0,100)
df2.dtypes
df2.describe()
g = sns.PairGrid(df2[['milliliters','points','price_per_liter','Category']], hue="Category")
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)
g = g.add_legend()
There are a lot of missing values for alcohol. Plus some values coded on the wrong scale.
df2.taster.value_counts()
30% of the reviews do not have an offical taster. They are labeled 'None'.
df2.Category.value_counts()
60% of the reviews are red, and 91% are either red or white.
df2.l5.value_counts()
pd.crosstab(df2.taster,df2.Category)
pd.crosstab(df2.l5,df2.taster)
pd.crosstab(df2.l5,df2.Category)
df2.boxplot(column='points', by=None)
df2.boxplot(column='points', by='Category')
df2.boxplot(column='points', by='l5',rot=45)
df2.boxplot(column='points', by='taster',rot=45)
df2.boxplot(column='price_per_liter', by=None,rot=45)
df2.boxplot(column='price_per_liter', by=None,rot=45 , showfliers=False)
df2.boxplot(column='price_per_liter', by='Category',rot=45)
df2.boxplot(column='price_per_liter', by='Category',rot=45 , showfliers=False)
df2.boxplot(column='price_per_liter', by='l5',rot=45)
df2.boxplot(column='price_per_liter', by='l5',rot=45 , showfliers=False)
df2.boxplot(column='price_per_liter', by='taster',rot=45)
df2.boxplot(column='price_per_liter', by='taster',rot=45, showfliers=False)
sns.set_style("whitegrid")
ax = sns.stripplot(x='points', y='price_per_liter_clip', jitter=True, data=df2, alpha=.25)
sns.set_style("whitegrid")
ax = sns.stripplot(x='points', y='price_per_liter_clip', hue="Category", data=df2, jitter=True, alpha=.25)
sns.set_style("whitegrid")
ax = sns.stripplot(x='points', y='price_per_liter_clip', hue="taster", data=df2, jitter=1, alpha=.25)
ax = sns.violinplot(x='points', y='price_per_liter_clip', data=df2,inner=None, color=".8")
g = sns.factorplot(x="Category", y="points",
col="taster",
data=df2, kind="strip",
jitter=True,
size=4, aspect=.7);
stats.probplot(np.ravel(df2[['points']]), dist="norm", plot=pylab)
pylab.show()
stats.probplot(np.ravel(df2[['price_per_liter']]), dist="norm", plot=pylab)
pylab.show()
stats.probplot(np.ravel(df2[['price_per_liter_clip']]), dist="norm", plot=pylab)
pylab.show()
pca_i = 20
vectorizer = TfidfVectorizer()
#vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))
vectorizer.fit(df2.description.tolist())
Tfidf_df = vectorizer.transform(df2.description.tolist())
my_normalizer1 = Normalizer()
my_normalizer1.fit(Tfidf_df)
Tfidf_df = my_normalizer1.transform(Tfidf_df)
svd1 = TruncatedSVD(n_components=pca_i, n_iter=7, random_state=42)
svd1.fit(Tfidf_df)
def word_cloud_for_pca(component_i):
old_component_i = component_i
component_i = np.flip(np.argsort(svd1.explained_variance_ratio_),0)[component_i]
pca_c_i = svd1.components_[component_i]
high_indexes = np.where(np.abs(pca_c_i)>.03)
my_features = vectorizer.get_feature_names()
my_dic ={}
for x in high_indexes[0]:
my_dic[my_features[x]]= math.floor(pca_c_i[x]*1000)
map_mask = np.array(Image.open("wine2_removed.png"))
wc = WordCloud(background_color="white", max_words=4000, mask=map_mask)
wc.generate_from_frequencies(my_dic)
plt.imshow(wc, interpolation='bilinear')
variance_explained = round(abs(svd1.explained_variance_ratio_[component_i]) * 100,2)
plt.title("principal component "+str(old_component_i+1) + " - Variance Explained "+str(variance_explained)+"%")
plt.axis("off")
plt.show()
for i in range(0,20):
word_cloud_for_pca(i)
The code for spliting and transforming the data is placed in a function instead of duplicating the code every where.
def return_model_data_points(train_size, pca_i, input_df, keep_vars, save_tools = False):
Category_df = pd.get_dummies(input_df[['Category']])
Category_df = Category_df.drop(Category_df.columns[0], axis=1)
Category_df = Category_df.reset_index(drop=True)
l5_df = pd.get_dummies(input_df[['l5']])
l5_df = l5_df.drop(l5_df.columns[0], axis=1)
l5_df = l5_df.reset_index(drop=True)
dummy_df = pd.concat([Category_df,l5_df], axis=1)
train_test_split_output = train_test_split(input_df, dummy_df,input_df[['points']] , random_state=1, test_size=1-train_size)
df_train, df_test, dummy_df_train, dummy_df_test, y_train, y_test = train_test_split_output
vectorizer = TfidfVectorizer()
#vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))
vectorizer.fit(df_train.description.tolist())
Tfidf_df_train = vectorizer.transform(df_train.description.tolist())
Tfidf_df_test = vectorizer.transform(df_test.description.tolist())
input_df_train = df_train.reset_index(drop=True).copy(deep=True)
input_df_test = df_test.reset_index(drop=True).copy(deep=True)
my_normalizer1 = Normalizer()
my_normalizer1.fit(Tfidf_df_train)
Tfidf_df_train = my_normalizer1.transform(Tfidf_df_train)
Tfidf_df_test = my_normalizer1.transform(Tfidf_df_test)
svd1 = TruncatedSVD(n_components=pca_i, n_iter=7, random_state=42)
svd1.fit(Tfidf_df_train)
text_df_train = pd.DataFrame(svd1.transform(Tfidf_df_train))
text_df_train = text_df_train.reset_index(drop=True)
text_df_test = pd.DataFrame(svd1.transform(Tfidf_df_test))
text_df_test = text_df_test.reset_index(drop=True)
input_df_train = input_df_train.reset_index(drop=True)
input_df_test = input_df_test.reset_index(drop=True)
dummy_df_train = dummy_df_train.reset_index(drop=True)
dummy_df_test = dummy_df_test.reset_index(drop=True)
text_df_train = text_df_train.reset_index(drop=True)
text_df_test = text_df_test.reset_index(drop=True)
final_input_train = pd.concat([input_df_train[keep_vars],dummy_df_train,text_df_train], axis=1)
final_input_test = pd.concat([input_df_test[keep_vars], dummy_df_test,text_df_test], axis=1)
scaler = StandardScaler()
scaler.fit(final_input_train)
XTrain = scaler.transform(final_input_train)
XTest = scaler.transform(final_input_test)
return(XTrain, XTest,
y_train, y_test)
g = sns.jointplot("price_per_liter", "points", data=df2, kind="reg")
g = sns.residplot("price_per_liter", "points", data=df2)
Going to only use clipped price per liter.
g = sns.jointplot("price_per_liter_clip", "points", data=df2, kind="reg")
g = sns.residplot("price_per_liter_clip", "points", data=df2)
results = sm.OLS(df2[['points']], df2[["price_per_liter_clip"]]).fit()
print(results.summary())
train_size=.2
pca_i=50
XTrain, XTest, y_train, y_test = return_model_data_points(train_size=train_size, pca_i=pca_i, input_df=df2, keep_vars=['price_per_liter_clip'])
reg = linear_model.LinearRegression()
reg.fit (XTrain, y_train)
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
sns.regplot(x=y_test, y=y_pred_test, fit_reg=False)
## Something is wrong...
for i in [0.1,0.5,1.0,2.0,5.0,10.0,50.0,100.0]:
reg = linear_model.Ridge(alpha=i)
reg.fit (XTrain, y_train)
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(i,"\t",train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
sns.regplot(x=y_test, y=y_pred_test, fit_reg=False)
for i in [0.01,.1,.2,.3,.4,.5,.6,.7,.8,.9,1]:
reg = linear_model.Lasso(alpha = i)
reg.fit (XTrain, y_train)
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(i,"\t",train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
for i in [0.01,.1,.25,.5,.75,1]:
for j in [0.01,.1,.2,.3,.4,.5,.6,.7,.8,.9,1]:
reg = linear_model.ElasticNet(alpha = i, l1_ratio=j)
reg.fit (XTrain, y_train)
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(i,"\t",j,"\t",train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
reg = linear_model.BayesianRidge()
reg.fit (XTrain, np.ravel(y_train))
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
sns.regplot(x=y_test, y=y_pred_test, fit_reg=False)
reg = svm.SVR()
reg.fit (XTrain, np.ravel(y_train))
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
sns.regplot(x=y_test, y=y_pred_test, fit_reg=False)
reg = KNeighborsRegressor(n_neighbors=5)
reg.fit (XTrain, np.ravel(y_train))
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
sns.regplot(x=y_test, y=y_pred_test, fit_reg=False)
for i in [1,2,3,4,5,6,7,8,9,10,11,12,13,14]:
reg = DecisionTreeRegressor(max_depth=i)
reg.fit (XTrain, np.ravel(y_train))
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(i,"\t",train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
for i in [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]:
reg = ExtraTreesRegressor(max_depth=i)
reg.fit (XTrain, np.ravel(y_train))
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(i,"\t",train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
sns.regplot(x=y_test, y=y_pred_test, fit_reg=False)
for i in [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]:
reg = RandomForestRegressor(max_depth=i)
reg.fit (XTrain, np.ravel(y_train))
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(i,"\t",train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
sns.regplot(x=y_test, y=y_pred_test, fit_reg=False)
reg = BaggingRegressor()
reg.fit (XTrain, np.ravel(y_train))
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
sns.regplot(x=y_test, y=y_pred_test, fit_reg=False)
reg = AdaBoostRegressor()
reg.fit (XTrain, np.ravel(y_train))
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
sns.regplot(x=y_test, y=y_pred_test, fit_reg=False)
reg = GradientBoostingRegressor()
reg.fit (XTrain, np.ravel(y_train))
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
sns.regplot(x=y_test, y=y_pred_test, fit_reg=False)
reg = MLPRegressor()
reg.fit (XTrain, np.ravel(y_train))
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
sns.regplot(x=y_test, y=y_pred_test, fit_reg=False)
scalery = StandardScaler()
scalery.fit(y_train)
reg = MLPRegressor( activation="logistic")
reg.fit (XTrain, np.ravel(scalery.transform(y_train)))
y_pred_train = scalery.inverse_transform(reg.predict(XTrain))
y_pred_test = scalery.inverse_transform(reg.predict(XTest))
print(train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
sns.regplot(x=y_test, y=y_pred_test, fit_reg=False)
%%javascript
// Since I append the div later, sometimes there are multiple divs.
$("#container0").remove();
// Make the cdiv to contain the chart.
element.append('<div id="container0" style="min-width: 310px; height: 400px; margin: 0 auto"></div>');
// Require highcarts and make the chart.
require(['highcharts_exports'], function(Highcharts) {
$('#container0').highcharts({
title: {
text: 'Regression on 20% Train'
},
plotOptions: {
scatter: {
dataLabels: {
format: "{point.name}",
enabled: true
},
enableMouseTracking: false
}
},
yAxis: {
title: {
text: 'test'
}
},xAxis: {
title: {
text: 'train'
}
},
legend: {
enabled: false
},
series: [{name:'LinearRegression',data:[[0.57693157242,0]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Ridge a-0.1',data:[[0.576935992332,0.573216499941]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Ridge a-0.5',data:[[0.576935992272,0.573216547326]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Ridge a-1',data:[[0.576935992086,0.573216606453]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Ridge a-2',data:[[0.576935991342,0.573216724366]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Ridge a-5',data:[[0.576935986135,0.573217075359]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Ridge a-10',data:[[0.576935967562,0.573217651208]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Ridge a-50',data:[[0.576935377753,0.573221848744]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Ridge a-100',data:[[0.5769335556,0.57322608157]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Lasso a-0.01',data:[[0.575986066361,0.572419294426]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Lasso a-0.1',data:[[0.534540295045,0.532605140083]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Lasso a-0.2',data:[[0.465400718504,0.465050013838]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Lasso a-0.3',data:[[0.41204104232,0.412654827311]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Lasso a-0.4',data:[[0.375154557165,0.375882765836]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Lasso a-0.5',data:[[0.341375800417,0.342083333886]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Lasso a-0.6',data:[[0.31349541044,0.314543717142]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Lasso a-0.7',data:[[0.296654285517,0.297816438112]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Lasso a-0.8',data:[[0.281645241169,0.282813967056]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Lasso a-0.9',data:[[0.264634990909,0.265790100838]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Lasso a-1',data:[[0.245623534736,0.246744839458]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ElasticNet',data:[[0.576893590782,0.573227722896]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'BayesianRidge',data:[[0.576933697755,0.57322586024]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'Epsilon-Support Vector Regression',data:[[0.70426790027,0.656844115156]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'KNeighborsRegressor',data:[[0.622527718053,0.432821996652]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-1',data:[[0.251932279139,0.249152937977]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-2',data:[[0.341400724192,0.338789311876]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-3',data:[[0.373823694253,0.367008755166]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-4',data:[[0.408825435999,0.398285452377]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-5',data:[[0.436963415561,0.418303149015]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-6',data:[[0.464584350532,0.435764443706]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-7',data:[[0.492742686219,0.444242915314]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-8',data:[[0.524234486312,0.445643770674]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-9',data:[[0.561980925478,0.434528450405]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-10',data:[[0.60646717857,0.409553241429]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-11',data:[[0.659616755222,0.377614687389]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-12',data:[[0.716702294151,0.331247997956]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-13',data:[[0.773246101054,0.282921586019]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'DecisionTreeRegressor md-14',data:[[0.826168052945,0.232335639987]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-1',data:[[0.276218453289,0.277302859305]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-2',data:[[0.334372146946,0.334800958001]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-3',data:[[0.341710864848,0.342158171534]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-4',data:[[0.395177437638,0.393018122047]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-5',data:[[0.418316846897,0.412944158794]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-6',data:[[0.433999979819,0.425637676073]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-7',data:[[0.47331644118,0.457239340214]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-8',data:[[0.503824204823,0.47544996857]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-9',data:[[0.536455051507,0.487497203427]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-10',data:[[0.573390642294,0.502803734809]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-11',data:[[0.614982125512,0.513066632459]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-12',data:[[0.648900220316,0.517159108847]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-13',data:[[0.705539937704,0.525858012936]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-14',data:[[0.75273761587,0.530661565234]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-15',data:[[0.824681855177,0.536727670344]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-16',data:[[0.839869737177,0.534983660466]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-17',data:[[0.880005192681,0.537550346258]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-18',data:[[0.917548922837,0.534934652633]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-19',data:[[0.947038604116,0.535134292795]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ExtraTreesRegressor md-20',data:[[0.961746205899,0.535312367497]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-1',data:[[0.251932190003,0.249150563766]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-2',data:[[0.345690387606,0.343724499419]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-3',data:[[0.385052276977,0.379435051034]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-4',data:[[0.425255082027,0.415273738603]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-5',data:[[0.462670423378,0.446221157983]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-6',data:[[0.493356006109,0.467804473027]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-7',data:[[0.529782851151,0.490420897987]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-8',data:[[0.567668710916,0.505424908231]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-9',data:[[0.611066100862,0.517659696631]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-10',data:[[0.660582133165,0.524881316839]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-11',data:[[0.704588271716,0.527043811798]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-12',data:[[0.751068808656,0.531196757332]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-13',data:[[0.792973665758,0.530265449309]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-14',data:[[0.832133500033,0.527065076469]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-15',data:[[0.852528706076,0.526638200629]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-16',data:[[0.873256080887,0.528913301794]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-17',data:[[0.887194127046,0.524580152187]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-18',data:[[0.897051010579,0.521906266137]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-19',data:[[0.903416142359,0.520207740243]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'RandomForestRegressor md-20',data:[[0.907977793058,0.52126633392]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'BaggingRegressor',data:[[0.913977685751,0.52026010201]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'AdaBoostRegressor',data:[[0.472260117792,0.462546681236]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'GradientBoostingRegressor',data:[[0.596213044953,0.570234555362]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'MLPRegressor',data:[[0.672571071177,0.586911674866]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'MLPRegressor Scaled',data:[[0.774730419101,0.604890555792]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}}]
});
});
train_size=.05
pca_i=100
XTrain, XTest, y_train, y_test = return_model_data_points(train_size=train_size, pca_i=pca_i, input_df=df2, keep_vars=['price_per_liter_clip'])
loss_list=['ls','lad','huber','quantile']
learning_rate_list = [0.1,0.15,0.2,.25,.5,.75]
max_depth_list = [1,2,3,4,5,6,7,8,9]
for loss_i in loss_list:
for learning_rate_i in learning_rate_list:
for max_depth_i in max_depth_list:
reg = GradientBoostingRegressor(loss=loss_i,learning_rate=learning_rate_i, max_depth=max_depth_i)
reg.fit (XTrain, np.ravel(y_train))
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(loss_i,"\t",learning_rate_i,"\t",max_depth_i,"\t",train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
%%javascript
// Since I append the div later, sometimes there are multiple divs.
$("#container1").remove();
// Make the cdiv to contain the chart.
element.append('<div id="container1" style="min-width: 310px; height: 400px; margin: 0 auto"></div>');
// Require highcarts and make the chart.
require(['highcharts_exports'], function(Highcharts) {
$('#container1').highcharts({
title: {
text: 'GradientBoostingRegressor'
},
plotOptions: {
scatter: {
dataLabels: {
format: "{point.name}",
enabled: true
},
enableMouseTracking: false
}
},
yAxis: {
title: {
text: 'test'
}
},xAxis: {
title: {
text: 'train'
}
},
legend: {
enabled: false
},
series: [{name:'ls - LR0.1 - MD1',data:[[0.483195481072,0.474670631554]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.1 - MD2',data:[[0.573312148488,0.542484573263]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.1 - MD3',data:[[0.639864359097,0.572574474065]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.1 - MD4',data:[[0.711570113162,0.591042550157]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.1 - MD5',data:[[0.788853944671,0.597958416133]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.1 - MD6',data:[[0.864170521402,0.599622857668]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.1 - MD7',data:[[0.930424766073,0.596030158475]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.1 - MD8',data:[[0.968269131224,0.587936201096]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.1 - MD9',data:[[0.991099822719,0.57738851849]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.15 - MD1',data:[[0.519173600994,0.506530808563]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.15 - MD2',data:[[0.606877851635,0.564962613525]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.15 - MD3',data:[[0.678489135064,0.589166602604]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.15 - MD4',data:[[0.754201228068,0.59949633042]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.15 - MD5',data:[[0.830949703617,0.601950337646]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.15 - MD6',data:[[0.897615111022,0.5980906965]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.15 - MD7',data:[[0.955596525332,0.590033544289]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.15 - MD8',data:[[0.983457954212,0.579793165364]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.15 - MD9',data:[[0.997128038571,0.567213632333]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.2 - MD1',data:[[0.542924522414,0.526893958616]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.2 - MD2',data:[[0.629785050773,0.578424459603]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.2 - MD3',data:[[0.703825126206,0.595344980637]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.2 - MD4',data:[[0.78146733493,0.601375735172]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.2 - MD5',data:[[0.855945803152,0.595180303543]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.2 - MD6',data:[[0.922226987158,0.590351006137]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.2 - MD7',data:[[0.969728483677,0.580146372507]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.2 - MD8',data:[[0.990807379293,0.568193397591]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.2 - MD9',data:[[0.998564304007,0.552465567057]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.25 - MD1',data:[[0.559405555776,0.540754120087]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.25 - MD2',data:[[0.64650911693,0.586309813089]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.25 - MD3',data:[[0.720777611522,0.598429031394]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.25 - MD4',data:[[0.800040485961,0.59607520576]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.25 - MD5',data:[[0.877771541174,0.590470901025]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.25 - MD6',data:[[0.939282385401,0.578703560005]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.25 - MD7',data:[[0.977229088685,0.567272206962]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.25 - MD8',data:[[0.995125947476,0.554312828629]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.25 - MD9',data:[[0.99955848554,0.540051802073]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.5 - MD1',data:[[0.597612520156,0.563577973277]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.5 - MD2',data:[[0.681079881294,0.582230505545]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.5 - MD3',data:[[0.761461322512,0.57405556384]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.5 - MD4',data:[[0.849517402345,0.544959603289]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.5 - MD5',data:[[0.928608187651,0.514958440526]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.5 - MD6',data:[[0.977741120904,0.492438416477]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.5 - MD7',data:[[0.995730500572,0.474708972368]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.5 - MD8',data:[[0.999773201584,0.461162568518]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.5 - MD9',data:[[0.999991519654,0.445805099294]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.75 - MD1',data:[[0.606852019672,0.561684732504]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.75 - MD2',data:[[0.691586078645,0.56485422784]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.75 - MD3',data:[[0.778952150523,0.512579454393]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.75 - MD4',data:[[0.877017801221,0.462789396318]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.75 - MD5',data:[[0.954800912462,0.400535774188]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.75 - MD6',data:[[0.991349439505,0.359251139115]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.75 - MD7',data:[[0.999104622718,0.338249867878]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.75 - MD8',data:[[0.999969242334,0.320387508687]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'ls - LR0.75 - MD9',data:[[0.999999723331,0.309320133057]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'lad - LR0.1 - MD1',data:[[0.470147244234,0.463792116696]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.1 - MD2',data:[[0.552889121664,0.53014259963]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.1 - MD3',data:[[0.61240896779,0.56339443216]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.1 - MD4',data:[[0.660189367638,0.574401693833]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.1 - MD5',data:[[0.708718359382,0.581121909447]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.1 - MD6',data:[[0.753550867917,0.583106521626]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.1 - MD7',data:[[0.79126941866,0.577200902192]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.1 - MD8',data:[[0.827173658816,0.572023240758]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.1 - MD9',data:[[0.86302982277,0.567393604828]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.15 - MD1',data:[[0.50531548943,0.495426880447]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.15 - MD2',data:[[0.584173448482,0.553337009746]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.15 - MD3',data:[[0.638287211259,0.575925501813]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.15 - MD4',data:[[0.686871547205,0.583072692261]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.15 - MD5',data:[[0.731838475214,0.584479040752]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.15 - MD6',data:[[0.777482549972,0.583185647494]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.15 - MD7',data:[[0.815524920058,0.579326200172]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.15 - MD8',data:[[0.8474383006,0.571369892454]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.15 - MD9',data:[[0.875936489803,0.56192039182]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.2 - MD1',data:[[0.526634366283,0.513101736847]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.2 - MD2',data:[[0.601853945648,0.564377943386]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.2 - MD3',data:[[0.657019346919,0.583643442889]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.2 - MD4',data:[[0.699606843876,0.583923832298]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.2 - MD5',data:[[0.746909026349,0.583862672448]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.2 - MD6',data:[[0.784414301077,0.580506250823]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.2 - MD7',data:[[0.825115258714,0.575212067096]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.2 - MD8',data:[[0.855249257636,0.563140362391]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.2 - MD9',data:[[0.887213285797,0.555947943111]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.25 - MD1',data:[[0.544964643176,0.529631214942]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.25 - MD2',data:[[0.614969012229,0.571436672092]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.25 - MD3',data:[[0.666229813029,0.584058564796]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.25 - MD4',data:[[0.711360297902,0.581802582589]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.25 - MD5',data:[[0.752913444559,0.579061446238]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.25 - MD6',data:[[0.784434659618,0.5741186018]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.25 - MD7',data:[[0.824056919477,0.567506882236]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.25 - MD8',data:[[0.861304393256,0.556674638704]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.25 - MD9',data:[[0.886159913456,0.548212034485]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.5 - MD1',data:[[0.541070180523,0.520994250388]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.5 - MD2',data:[[0.647893425537,0.577629536759]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.5 - MD3',data:[[0.688345597803,0.57000271842]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.5 - MD4',data:[[0.733897399816,0.555695882275]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.5 - MD5',data:[[0.769537623812,0.547997992185]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.5 - MD6',data:[[0.807249667709,0.533532324904]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.5 - MD7',data:[[0.835757890632,0.518192532309]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.5 - MD8',data:[[0.87284285622,0.498597683566]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.5 - MD9',data:[[0.896917529415,0.484213472853]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.75 - MD1',data:[[0.596317283023,0.556183736385]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.75 - MD2',data:[[0.652990330282,0.563488121362]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.75 - MD3',data:[[0.698452223812,0.54571542435]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.75 - MD4',data:[[0.731143277028,0.512059059729]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.75 - MD5',data:[[0.782148045705,0.486256502673]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.75 - MD6',data:[[0.811386651861,0.45511620425]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.75 - MD7',data:[[0.843396315063,0.439166826033]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.75 - MD8',data:[[0.879474561023,0.412184112483]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'lad - LR0.75 - MD9',data:[[0.897729340138,0.390995879317]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'huber - LR0.1 - MD1',data:[[0.484702820443,0.476256697828]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.1 - MD2',data:[[0.572787279837,0.543573328207]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.1 - MD3',data:[[0.636429581541,0.572089154825]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.1 - MD4',data:[[0.705626481701,0.58987898467]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.1 - MD5',data:[[0.782201997196,0.597765022006]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.1 - MD6',data:[[0.852819556298,0.598969575974]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.1 - MD7',data:[[0.914158427534,0.595389577059]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.1 - MD8',data:[[0.955300038064,0.586869064995]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.1 - MD9',data:[[0.980411857223,0.577367233848]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.15 - MD1',data:[[0.519941627966,0.507897862138]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.15 - MD2',data:[[0.607327118468,0.566325242824]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.15 - MD3',data:[[0.676699207771,0.589211731403]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.15 - MD4',data:[[0.746618168939,0.59665550598]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.15 - MD5',data:[[0.821272306105,0.60081752492]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.15 - MD6',data:[[0.885438298961,0.59769842929]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.15 - MD7',data:[[0.937023503168,0.590278029409]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.15 - MD8',data:[[0.970426747868,0.581065009216]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.15 - MD9',data:[[0.986964960927,0.566564430631]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.2 - MD1',data:[[0.54256712556,0.528182412243]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.2 - MD2',data:[[0.628663643289,0.578246036712]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.2 - MD3',data:[[0.699652365591,0.596663826998]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.2 - MD4',data:[[0.770187654079,0.598169269974]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.2 - MD5',data:[[0.844262425768,0.596027464119]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.2 - MD6',data:[[0.905089969876,0.589218799269]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.2 - MD7',data:[[0.954889724486,0.579975597426]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.2 - MD8',data:[[0.979431370891,0.567294999257]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.2 - MD9',data:[[0.991727830882,0.554679291695]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.25 - MD1',data:[[0.559720657401,0.541083448514]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.25 - MD2',data:[[0.644148356014,0.585851345331]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.25 - MD3',data:[[0.717523746548,0.598188059469]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.25 - MD4',data:[[0.787801596543,0.595209847844]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.25 - MD5',data:[[0.863754625038,0.589298549984]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.25 - MD6',data:[[0.921083919934,0.579992048102]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.25 - MD7',data:[[0.965034691734,0.568629367988]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.25 - MD8',data:[[0.983509172086,0.556772124846]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.25 - MD9',data:[[0.993460904396,0.541221352857]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.5 - MD1',data:[[0.595676165738,0.564154299715]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.5 - MD2',data:[[0.678149425592,0.583144684586]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.5 - MD3',data:[[0.758631929306,0.575104151827]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.5 - MD4',data:[[0.835414885511,0.550314716196]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.5 - MD5',data:[[0.909606389212,0.524410034328]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.5 - MD6',data:[[0.963297462163,0.502234861715]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.5 - MD7',data:[[0.984285631928,0.484700504454]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.5 - MD8',data:[[0.992078885691,0.47157913527]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.5 - MD9',data:[[0.996951070307,0.45165522074]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.75 - MD1',data:[[0.606945820357,0.559035997392]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.75 - MD2',data:[[0.689569079414,0.558096043278]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.75 - MD3',data:[[0.774867188469,0.520734366421]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.75 - MD4',data:[[0.859730039591,0.46570304241]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.75 - MD5',data:[[0.936316030797,0.410348470649]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.75 - MD6',data:[[0.977501177396,0.381674699951]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.75 - MD7',data:[[0.987516730978,0.365960005809]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.75 - MD8',data:[[0.992953428422,0.344039423157]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'huber - LR0.75 - MD9',data:[[0.997207734394,0.320933042798]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'quantile - LR0.1 - MD1',data:[[0.423238318883,0.424403863531]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.1 - MD2',data:[[0.497303156164,0.484234218165]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.1 - MD3',data:[[0.533069482006,0.507447726265]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.1 - MD4',data:[[0.55817468672,0.516466130494]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.1 - MD5',data:[[0.57573691731,0.517066136644]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.1 - MD6',data:[[0.587564188439,0.512729279197]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.1 - MD7',data:[[0.593097890283,0.501817690723]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.1 - MD8',data:[[0.610329728067,0.500073852524]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.1 - MD9',data:[[0.608361601864,0.487795052256]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.15 - MD1',data:[[0.45620187398,0.454324648218]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.15 - MD2',data:[[0.520808531602,0.502712682075]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.15 - MD3',data:[[0.556432890711,0.52328556955]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.15 - MD4',data:[[0.575624901586,0.524978122751]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.15 - MD5',data:[[0.589029428191,0.52458893886]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.15 - MD6',data:[[0.601383893006,0.517931405175]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.15 - MD7',data:[[0.620786448514,0.515524550746]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.15 - MD8',data:[[0.623015806904,0.502629595446]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.15 - MD9',data:[[0.62340850872,0.491110477043]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.2 - MD1',data:[[0.476672151201,0.471948670592]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.2 - MD2',data:[[0.537850554478,0.514182800628]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.2 - MD3',data:[[0.561246711192,0.524261876705]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.2 - MD4',data:[[0.593101129083,0.533834121163]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.2 - MD5',data:[[0.605721657923,0.530259911835]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.2 - MD6',data:[[0.616357172131,0.524147428767]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.2 - MD7',data:[[0.627130242245,0.516334188837]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.2 - MD8',data:[[0.64259263376,0.510422616685]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.2 - MD9',data:[[0.645490438547,0.501146966844]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.25 - MD1',data:[[0.491012773207,0.485374274742]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.25 - MD2',data:[[0.54531751435,0.519866482481]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.25 - MD3',data:[[0.575786948041,0.532538708395]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.25 - MD4',data:[[0.597126659958,0.530261221923]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.25 - MD5',data:[[0.605907437887,0.528458734123]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.25 - MD6',data:[[0.628726805734,0.52590127478]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.25 - MD7',data:[[0.639785679312,0.519928756854]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.25 - MD8',data:[[0.638370522989,0.504382082233]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.25 - MD9',data:[[0.654651683521,0.498127846258]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.5 - MD1',data:[[0.532728180733,0.517173859921]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.5 - MD2',data:[[0.55852769516,0.522711256208]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.5 - MD3',data:[[0.58673388658,0.524853661187]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.5 - MD4',data:[[0.619461519386,0.523341562462]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.5 - MD5',data:[[0.630153104742,0.526933204881]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.5 - MD6',data:[[0.637763280962,0.519793379464]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.5 - MD7',data:[[0.649569346495,0.500441390607]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.5 - MD8',data:[[0.662254206386,0.486503893056]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.5 - MD9',data:[[0.664303147619,0.48152678482]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.75 - MD1',data:[[0.53723249853,0.517791488697]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.75 - MD2',data:[[0.568583975795,0.517445645223]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.75 - MD3',data:[[0.585558293366,0.509988370252]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.75 - MD4',data:[[0.598506842354,0.49850085886]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.75 - MD5',data:[[0.618357956888,0.492355617663]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.75 - MD6',data:[[0.622885058303,0.474975221008]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.75 - MD7',data:[[0.656488202042,0.473372144055]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.75 - MD8',data:[[0.660201407577,0.451091621879]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'quantile - LR0.75 - MD9',data:[[0.672359070201,0.445650099879]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}}]
});
});
reg = GradientBoostingRegressor()
reg.fit (XTrain, np.ravel(y_train))
y_pred_train = reg.predict(XTrain)
y_pred_test = reg.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
scalery = StandardScaler()
scalery.fit(y_train)
activation_list= ["identity","logistic","tanh","relu"]
solver_list = ['lbfgs','sgd','adam']
for activation_i in activation_list:
for solver_i in solver_list:
reg = MLPRegressor( activation=activation_i,solver=solver_i)
reg.fit (XTrain, np.ravel(scalery.transform(y_train)))
y_pred_train = scalery.inverse_transform(reg.predict(XTrain))
y_pred_test = scalery.inverse_transform(reg.predict(XTest))
print(activation_i,"\t",solver_i,"\t",train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
%%javascript
// Since I append the div later, sometimes there are multiple divs.
$("#container2").remove();
// Make the cdiv to contain the chart.
element.append('<div id="container2" style="min-width: 310px; height: 400px; margin: 0 auto"></div>');
// Require highcarts and make the chart.
require(['highcharts_exports'], function(Highcharts) {
$('#container2').highcharts({
title: {
text: 'MLPRegressor'
},
plotOptions: {
scatter: {
dataLabels: {
format: "{point.name}",
enabled: true
},
enableMouseTracking: false
}
},
yAxis: {
title: {
text: 'test'
}
},xAxis: {
title: {
text: 'train'
}
},
legend: {
enabled: false
},
series: [{name:'identity - lbfgs ',data:[[0.607154096275,0.600211284445]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'identity - sgd ',data:[[0.606660310127,0.599485790112]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'identity - adam ',data:[[0.570139742024,0.564111497658]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'logistic - lbfgs ',data:[[0.909963398761,0.459365415587]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'logistic - sgd ',data:[[0.610245249329,0.60308284211]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'logistic - adam ',data:[[0.981950120395,0.400868557658]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'tanh - lbfgs ',data:[[0.990816806993,0.0976663334808]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'tanh - sgd ',data:[[0.6966214225,0.58945632735]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'tanh - adam ',data:[[0.991294687258,0.159512907298]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'relu - lbfgs ',data:[[0.937123552089,0.367925087958]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'relu - sgd ',data:[[0.732436830781,0.604889248315]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'relu - adam ',data:[[0.83767932459,0.557892075499]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}}]
});
});
scalery = StandardScaler()
scalery.fit(y_train)
reg = MLPRegressor( )
reg.fit (XTrain, np.ravel(scalery.transform(y_train)))
y_pred_train = scalery.inverse_transform(reg.predict(XTrain))
y_pred_test = scalery.inverse_transform(reg.predict(XTest))
print(train_size,"\t",pca_i, "\t",
metrics.explained_variance_score(y_true = y_train, y_pred = y_pred_train),"\t",
metrics.explained_variance_score(y_true = y_test, y_pred = y_pred_test))
from sklearn.preprocessing import MinMaxScaler
def return_model_data_points(train_size, pca_i, input_df, keep_vars, save_tools = False):
taster_df = pd.get_dummies(input_df[['taster']])
taster_df = taster_df.drop(taster_df.columns[0], axis=1)
taster_df = taster_df.reset_index(drop=True)
Category_df = pd.get_dummies(input_df[['Category']])
Category_df = Category_df.drop(Category_df.columns[0], axis=1)
Category_df = Category_df.reset_index(drop=True)
l5_df = pd.get_dummies(input_df[['l5']])
l5_df = l5_df.drop(l5_df.columns[0], axis=1)
l5_df = l5_df.reset_index(drop=True)
dummy_df = pd.concat([taster_df,Category_df,l5_df], axis=1)
train_test_split_output = train_test_split(input_df, dummy_df, input_df[['points']] , random_state=1, test_size=1-train_size)
df_train, df_test, dummy_df_train, dummy_df_test, y1_train, y1_test = train_test_split_output
scaler_points = MinMaxScaler(feature_range=(0.01, .99))
scaler_points.fit(y1_train)
if save_tools:
joblib.dump(scaler_points, 'scaler_points.pkl')
y1_train_s = scaler_points.transform(y1_train)
y1_test_s = scaler_points.transform(y1_test)
vectorizer = TfidfVectorizer()
#vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))
vectorizer.fit(df_train.description.tolist())
if save_tools:
joblib.dump(vectorizer, 'vectorizer.pkl')
Tfidf_df_train = vectorizer.transform(df_train.description.tolist())
Tfidf_df_test = vectorizer.transform(df_test.description.tolist())
input_df_train = df_train.reset_index(drop=True).copy(deep=True)
input_df_test = df_test.reset_index(drop=True).copy(deep=True)
my_normalizer1 = Normalizer()
my_normalizer1.fit(Tfidf_df_train)
if save_tools:
joblib.dump(my_normalizer1, 'normalizer1.pkl')
Tfidf_df_train = my_normalizer1.transform(Tfidf_df_train)
Tfidf_df_test = my_normalizer1.transform(Tfidf_df_test)
svd1 = TruncatedSVD(n_components=pca_i, n_iter=7, random_state=42)
svd1.fit(Tfidf_df_train)
if save_tools:
joblib.dump(svd1, 'svd1.pkl')
text_df_train = pd.DataFrame(svd1.transform(Tfidf_df_train))
text_df_train = text_df_train.reset_index(drop=True)
text_df_test = pd.DataFrame(svd1.transform(Tfidf_df_test))
text_df_test = text_df_test.reset_index(drop=True)
input_df_train = input_df_train.reset_index(drop=True)
input_df_test = input_df_test.reset_index(drop=True)
dummy_df_train = dummy_df_train.reset_index(drop=True)
dummy_df_test = dummy_df_test.reset_index(drop=True)
text_df_train = text_df_train.reset_index(drop=True)
text_df_test = text_df_test.reset_index(drop=True)
final_input_train = pd.concat([input_df_train[keep_vars],dummy_df_train,text_df_train], axis=1)
final_input_test = pd.concat([input_df_test[keep_vars], dummy_df_test,text_df_test], axis=1)
scaler = StandardScaler()
scaler.fit(final_input_train)
if save_tools:
joblib.dump(scaler, 'scaler.pkl')
XTrain = scaler.transform(final_input_train)
XTest = scaler.transform(final_input_test)
return(XTrain, XTest,
y1_train, y1_test,
y1_train_s, y1_test_s,
vectorizer,my_normalizer1,svd1,scaler,scaler_points)
class neuralNetwork:
def __init__(self, input_nodes, hidden_nodes, output_nodes, learning_rate,
weights_input_to_hidden,weights_hidden_to_output,scaler_points):
self.input_nodes = input_nodes
self.hidden_nodes = hidden_nodes
self.output_nodes = output_nodes
self.weights_input_to_hidden = weights_input_to_hidden
self.weights_hidden_to_output = weights_hidden_to_output
self.learning_rate = learning_rate
self.y1_n = 1
self.e = 0
self.scaler_points = scaler_points
pass
def activation_function(self,x):
return sc.special.expit(x)
def get_e(self):
return self.e
def get_lookup_Tables(self):
return (self.lookupTable1, self.lookupTable2, self.lookupTable3)
def train(self, inputs_list, targets_list):
inputs = np.array(inputs_list, ndmin=2).T
targets = np.array(targets_list, ndmin=2).T
hidden_inputs = np.dot(self.weights_input_to_hidden, inputs)
hidden_outputs = self.activation_function(hidden_inputs)
final_inputs = np.dot(self.weights_hidden_to_output, hidden_outputs)
final_outputs = self.activation_function(final_inputs)
output_errors = targets - final_outputs
hidden_errors = np.dot(self.weights_hidden_to_output.T, output_errors)
self.weights_hidden_to_output += self.learning_rate * np.dot(
(output_errors * final_outputs * (1.0 - final_outputs)), np.transpose(hidden_outputs))
self.weights_input_to_hidden += self.learning_rate * np.dot(
(hidden_errors * hidden_outputs * (1.0 - hidden_outputs)), np.transpose(inputs))
self.e += 1
pass
def train_df(self, XTrain, y1_train):
for x, y1 in zip(XTrain, y1_train):
inputs = np.asfarray(x)
self.train(inputs, y1)
pass
pass
def query(self, inputs_list):
inputs = np.array(inputs_list, ndmin=2).T
hidden_inputs = np.dot(self.weights_input_to_hidden, inputs)
hidden_outputs = self.activation_function(hidden_inputs)
final_inputs = np.dot(self.weights_hidden_to_output, hidden_outputs)
final_outputs = self.activation_function(final_inputs)
return final_outputs
def get_accuracy(self, XTrain, y1_train, y1_train_s):
count_total = 0
y_1_prediction = []
for x in XTrain:
inputs = np.asfarray(x)
predicted_value = self.query(inputs)
y_1_prediction.append(self.scaler_points.inverse_transform(predicted_value[0][0])[0][0])
count_total += 1
print(self.hidden_nodes, self.e, count_total,
metrics.explained_variance_score(y_true = y1_train, y_pred = y_1_prediction), sep='\t')
return (self.hidden_nodes, self.e, count_total)
train_size=.2
pca_i=100
output_data = return_model_data_points(train_size=train_size, pca_i=pca_i, input_df=df2, keep_vars=['price_per_liter_clip'], save_tools = True)
XTrain, XTest, y1_train, y1_test,y1_train_s, y1_test_s, vectorizer, normalizer1, svd1, scaler,scaler_points = output_data
y1_n= 1
# number of input, hidden and output nodes
input_nodes = XTrain.shape[1]
hidden_nodes = 100
output_nodes = 1
# learning rate
learning_rate = 0.001
weights_input_to_hidden = np.random.normal(0.0, pow(input_nodes, -0.5),
(hidden_nodes, input_nodes))
weights_hidden_to_output = np.random.normal(0.0, pow(hidden_nodes, -0.5),
(output_nodes, hidden_nodes))
n = neuralNetwork(input_nodes, hidden_nodes, output_nodes, learning_rate,
weights_input_to_hidden=weights_input_to_hidden,
weights_hidden_to_output=weights_hidden_to_output,scaler_points=scaler_points)
epochs = 50
for e in range(epochs):
XTrain, y1_train, y1_train_s = shuffle(XTrain, y1_train,y1_train_s)
n.train_df(XTrain, y1_train_s)
n.get_accuracy(XTrain, y1_train, y1_train_s)
n.get_accuracy(XTest, y1_test, y1_test_s)
pass
%%javascript
// Since I append the div later, sometimes there are multiple divs.
$("#containernnp").remove();
// Make the cdiv to contain the chart.
element.append('<div id="containernnp" style="min-width: 310px; height: 400px; margin: 0 auto"></div>');
// Require highcarts and make the chart.
require(['highcharts_exports'], function(Highcharts) {
$('#containernnp').highcharts({
title: {
text: 'Custom Neural Network Points'
},
plotOptions: {
scatter: {
dataLabels: {
format: "{point.name}",
enabled: true
},
enableMouseTracking: false
}
},
yAxis: {
title: {
text: 'test'
}
},xAxis: {
title: {
text: 'train'
}
},
legend: {
enabled: false
},
series: [{name:'epoch 1',data:[[0.476865355386,0.477218923812]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 2',data:[[0.589641039695,0.588546596474]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 3',data:[[0.617138503993,0.615377340686]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 4',data:[[0.62578331654,0.623662834873]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 5',data:[[0.62889553909,0.626558528971]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 6',data:[[0.63042431661,0.627900952914]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 7',data:[[0.631436233198,0.628763146679]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 8',data:[[0.632202210426,0.629364831414]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 9',data:[[0.632301557825,0.629294840102]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 10',data:[[0.632978913559,0.629877691744]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 11',data:[[0.633290202922,0.630081236283]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 12',data:[[0.633522091636,0.630190907484]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 13',data:[[0.633890353599,0.630476349082]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 14',data:[[0.634001673426,0.630506819694]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 15',data:[[0.634259329029,0.630678026042]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 16',data:[[0.634387739692,0.630740908153]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 17',data:[[0.634742820067,0.631030774431]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 18',data:[[0.634700836538,0.630925040418]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 19',data:[[0.635132375443,0.631310940672]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 20',data:[[0.635351636091,0.631412283026]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 21',data:[[0.635435090832,0.631496069688]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 22',data:[[0.635456373286,0.631379350651]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 23',data:[[0.635708390833,0.631586816922]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 24',data:[[0.63603526151,0.631856187121]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 25',data:[[0.636340811359,0.632088499988]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 26',data:[[0.636227860658,0.63194530438]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 27',data:[[0.636683093283,0.632286977968]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 28',data:[[0.63672849179,0.632285703957]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 29',data:[[0.636807487555,0.632322461282]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 30',data:[[0.636876685682,0.632344876871]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 31',data:[[0.637056258573,0.632448006455]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 32',data:[[0.637548963569,0.632840976959]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 33',data:[[0.637481745302,0.632699481775]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 34',data:[[0.637481205413,0.632657981578]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 35',data:[[0.637640253374,0.632764898196]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 36',data:[[0.637976723973,0.633070154704]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 37',data:[[0.638261978346,0.633258811183]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 38',data:[[0.638203933835,0.63316757594]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 39',data:[[0.638381842658,0.633226762028]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 40',data:[[0.638929517814,0.633721312942]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 41',data:[[0.638581784362,0.633333598648]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 42',data:[[0.638702204205,0.633416021884]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 43',data:[[0.639263863602,0.63388720947]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 44',data:[[0.639259833497,0.633873927296]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 45',data:[[0.639383928788,0.63388708343]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 46',data:[[0.639388729534,0.633856365033]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 47',data:[[0.639521024905,0.633950948046]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 48',data:[[0.640012491852,0.634337398002]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 49',data:[[0.639964323043,0.634215570946]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}},
{name:'epoch 50',data:[[0.640104692876,0.634340015271]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'circle'}}]
});
});
Predict the Country, Category, and Taster from the points, price, and review. The following methods have been tried:
MLP, Bagging, AdaBoostr, DecisionTree, RandomForest, ExtraTrees, GradientBoosting, C-Support Vector Classification, KNeighbors, and VotingEnsemble.
Run a neural network backwards to discover the terms per counrty, category, and taster that matter.
def return_model_data(train_size, pca_i, input_df, keep_vars, save_tools = False):
lookupTable1, indexed_1 = np.unique(input_df[['taster']], return_inverse=True)
lookupTable2, indexed_2 = np.unique(input_df[['Category']], return_inverse=True)
lookupTable3, indexed_3 = np.unique(input_df[['l5']], return_inverse=True)
train_test_split_output = train_test_split(input_df, indexed_1,indexed_2,indexed_3, random_state=1, test_size=1-train_size)
df_train, df_test, y1_train, y1_test, y2_train, y2_test, y3_train, y3_test = train_test_split_output
vectorizer = TfidfVectorizer()
#vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))
vectorizer.fit(df_train.description.tolist())
if save_tools:
joblib.dump(vectorizer, 'vectorizer.pkl')
Tfidf_df_train = vectorizer.transform(df_train.description.tolist())
Tfidf_df_test = vectorizer.transform(df_test.description.tolist())
input_df_train = df_train.reset_index(drop=True).copy(deep=True)
input_df_test = df_test.reset_index(drop=True).copy(deep=True)
my_normalizer1 = Normalizer()
my_normalizer1.fit(Tfidf_df_train)
if save_tools:
joblib.dump(normalizer1, 'normalizer1.pkl')
Tfidf_df_train = my_normalizer1.transform(Tfidf_df_train)
Tfidf_df_test = my_normalizer1.transform(Tfidf_df_test)
svd1 = TruncatedSVD(n_components=pca_i, n_iter=7, random_state=42)
svd1.fit(Tfidf_df_train)
if save_tools:
joblib.dump(svd1, 'svd1.pkl')
text_df_train = pd.DataFrame(svd1.transform(Tfidf_df_train))
text_df_train = text_df_train.reset_index(drop=True)
text_df_test = pd.DataFrame(svd1.transform(Tfidf_df_test))
text_df_test = text_df_test.reset_index(drop=True)
final_input_train = pd.concat([input_df_train[keep_vars],text_df_train], axis=1)
final_input_test = pd.concat([input_df_test[keep_vars],text_df_test], axis=1)
scaler = StandardScaler()
scaler.fit(final_input_train)
if save_tools:
joblib.dump(scaler, 'scaler.pkl')
XTrain = scaler.transform(final_input_train)
XTest = scaler.transform(final_input_test)
return(XTrain, XTest,
y1_train, y1_test,
y2_train, y2_test,
y3_train, y3_test,
lookupTable1,lookupTable2,lookupTable3,
vectorizer,my_normalizer1,svd1,scaler)
from IPython.core.display import HTML
train_size=.01
pca_i=50
output_data = return_model_data(train_size=train_size, pca_i=pca_i, input_df=df2, keep_vars=['points','price_per_liter'])
XTrain, XTest, y1_train, y1_test, y2_train, y2_test, y3_train, y3_test,lookupTable1,lookupTable2,lookupTable3,vectorizer,normalizer1,svd1,scaler = output_data
mlpc = MLPClassifier(solver='lbfgs', alpha=1e-5, random_state=1)
mlpc.fit(XTrain, y1_train)
y_pred_train = mlpc.predict(XTrain)
y_pred_test = mlpc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y1_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y1_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y1_test, y_pred=y_pred_test)).to_html()))
mlpc = MLPClassifier(solver='lbfgs', alpha=1e-5, random_state=1)
mlpc.fit(XTrain, y2_train)
y_pred_train = mlpc.predict(XTrain)
y_pred_test = mlpc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y2_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y2_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y2_test, y_pred=y_pred_test)).to_html()))
mlpc = MLPClassifier(solver='lbfgs', alpha=1e-5, random_state=1)
mlpc.fit(XTrain, y3_train)
y_pred_train = mlpc.predict(XTrain)
y_pred_test = mlpc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y3_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y3_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y3_test, y_pred=y_pred_test)).to_html()))
bgc = BaggingClassifier(random_state=41)
bgc.fit(XTrain, y1_train)
y_pred_train = bgc.predict(XTrain)
y_pred_test = bgc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y1_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y1_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y1_test, y_pred=y_pred_test)).to_html()))
bgc = BaggingClassifier(random_state=41)
bgc.fit(XTrain, y2_train)
y_pred_train = bgc.predict(XTrain)
y_pred_test = bgc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y2_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y2_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y2_test, y_pred=y_pred_test)).to_html()))
bgc = BaggingClassifier(random_state=41)
bgc.fit(XTrain, y3_train)
y_pred_train = bgc.predict(XTrain)
y_pred_test = bgc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y3_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y3_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y3_test, y_pred=y_pred_test)).to_html()))
abc = AdaBoostClassifier(random_state=41)
abc.fit(XTrain, y1_train)
y_pred_train = abc.predict(XTrain)
y_pred_test = abc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y1_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y1_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y1_test, y_pred=y_pred_test)).to_html()))
abc = AdaBoostClassifier(random_state=41)
abc.fit(XTrain, y2_train)
y_pred_train = abc.predict(XTrain)
y_pred_test = abc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y2_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y2_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y2_test, y_pred=y_pred_test)).to_html()))
abc = AdaBoostClassifier(random_state=41)
abc.fit(XTrain, y3_train)
y_pred_train = abc.predict(XTrain)
y_pred_test = abc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y3_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y3_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y3_test, y_pred=y_pred_test)).to_html()))
for max_depth in range(5,20):
decisionTree = DecisionTreeClassifier(max_depth=max_depth)
decisionTree.fit(XTrain, y1_train)
y_pred_train = decisionTree.predict(XTrain)
y_pred_test = decisionTree.predict(XTest)
print(max_depth,"\t",train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y1_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y1_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y1_test, y_pred=y_pred_test)).to_html()))
for max_depth in range(5,20):
decisionTree = DecisionTreeClassifier(max_depth=max_depth)
decisionTree.fit(XTrain, y2_train)
y_pred_train = decisionTree.predict(XTrain)
y_pred_test = decisionTree.predict(XTest)
print(max_depth,"\t",train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y2_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y2_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y2_test, y_pred=y_pred_test)).to_html()))
for max_depth in range(5,20):
decisionTree = DecisionTreeClassifier(max_depth=max_depth)
decisionTree.fit(XTrain, y3_train)
y_pred_train = decisionTree.predict(XTrain)
y_pred_test = decisionTree.predict(XTest)
print(max_depth,"\t",train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y3_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y3_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y3_test, y_pred=y_pred_test)).to_html()))
rfc = RandomForestClassifier( random_state=41,n_jobs=-1)
rfc.fit(XTrain, y1_train)
y_pred_train = rfc.predict(XTrain)
y_pred_test = rfc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y1_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y1_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y1_test, y_pred=y_pred_test)).to_html()))
rfc = RandomForestClassifier( random_state=41,n_jobs=-1)
rfc.fit(XTrain, y2_train)
y_pred_train = rfc.predict(XTrain)
y_pred_test = rfc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y2_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y2_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y2_test, y_pred=y_pred_test)).to_html()))
rfc = RandomForestClassifier( random_state=41,n_jobs=-1)
rfc.fit(XTrain, y3_train)
y_pred_train = rfc.predict(XTrain)
y_pred_test = rfc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y3_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y3_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y3_test, y_pred=y_pred_test)).to_html()))
etc = ExtraTreesClassifier( random_state=41,n_jobs=-1)
etc.fit(XTrain, y1_train)
y_pred_train = etc.predict(XTrain)
y_pred_test = etc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y1_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y1_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y1_test, y_pred=y_pred_test)).to_html()))
etc = ExtraTreesClassifier( random_state=41,n_jobs=-1)
etc.fit(XTrain, y2_train)
y_pred_train = etc.predict(XTrain)
y_pred_test = etc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y2_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y2_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y2_test, y_pred=y_pred_test)).to_html()))
etc = ExtraTreesClassifier( random_state=41,n_jobs=-1)
etc.fit(XTrain, y3_train)
y_pred_train = etc.predict(XTrain)
y_pred_test = etc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y3_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y3_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y3_test, y_pred=y_pred_test)).to_html()))
gbc = GradientBoostingClassifier()
gbc.fit(XTrain, y1_train)
y_pred_train = gbc.predict(XTrain)
y_pred_test = gbc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y1_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y1_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y1_test, y_pred=y_pred_test)).to_html()))
gbc = GradientBoostingClassifier()
gbc.fit(XTrain, y2_train)
y_pred_train = gbc.predict(XTrain)
y_pred_test = gbc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y2_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y2_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y2_test, y_pred=y_pred_test)).to_html()))
gbc = GradientBoostingClassifier()
gbc.fit(XTrain, y3_train)
y_pred_train = gbc.predict(XTrain)
y_pred_test = gbc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y3_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y3_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y3_test, y_pred=y_pred_test)).to_html()))
svc = SVC(kernel='rbf',C=3)
svc.fit(XTrain, y1_train)
y_pred_train = svc.predict(XTrain)
y_pred_test = svc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y1_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y1_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y1_test, y_pred=y_pred_test)).to_html()))
svc = SVC(kernel='rbf',C=3)
svc.fit(XTrain, y2_train)
y_pred_train = svc.predict(XTrain)
y_pred_test = svc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y2_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y2_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y2_test, y_pred=y_pred_test)).to_html()))
svc = SVC(kernel='rbf',C=3)
svc.fit(XTrain, y3_train)
y_pred_train = svc.predict(XTrain)
y_pred_test = svc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y3_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y3_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y3_test, y_pred=y_pred_test)).to_html()))
for n_neighbors in range(2,8):
knc = KNeighborsClassifier(n_neighbors=n_neighbors)
gbc.fit(XTrain, y1_train)
y_pred_train = gbc.predict(XTrain)
y_pred_test = gbc.predict(XTest)
print(n_neighbors,"\t",train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y1_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y1_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y1_test, y_pred=y_pred_test)).to_html()))
for n_neighbors in range(2,8):
knc = KNeighborsClassifier(n_neighbors=n_neighbors)
gbc.fit(XTrain, y2_train)
y_pred_train = gbc.predict(XTrain)
y_pred_test = gbc.predict(XTest)
print(n_neighbors,"\t",train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y2_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y2_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y2_test, y_pred=y_pred_test)).to_html()))
for n_neighbors in range(2,8):
knc = KNeighborsClassifier(n_neighbors=n_neighbors)
gbc.fit(XTrain, y3_train)
y_pred_train = gbc.predict(XTrain)
y_pred_test = gbc.predict(XTest)
print(n_neighbors,"\t",train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y3_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y3_test, y_pred = y_pred_test))
display(HTML(pd.DataFrame(metrics.confusion_matrix(y_true=y3_test, y_pred=y_pred_test)).to_html()))
%%javascript
// Since I append the div later, sometimes there are multiple divs.
$("#container3").remove();
// Make the cdiv to contain the chart.
element.append('<div id="container3" style="min-width: 310px; height: 400px; margin: 0 auto"></div>');
// Require highcarts and make the chart.
require(['highcharts_exports'], function(Highcharts) {
$('#container3').highcharts({
title: {
text: 'Classification'
},
plotOptions: {
scatter: {
dataLabels: {
format: "{point.name}",
enabled: true
},
enableMouseTracking: false
}
},
yAxis: {
title: {
text: 'test'
}
},xAxis: {
title: {
text: 'train'
}
},
legend: {
enabled: false
},
series: [{name:'Taster - MLPClassifier',data:[[1,0.69877332204]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'Category - MLPClassifier',data:[[1,0.858154833804]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'Country - MLPClassifier',data:[[1,0.582068076055]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'Taster - BaggingClassifier',data:[[0.993326978074,0.569632862842]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'Category - BaggingClassifier',data:[[0.988083889418,0.850993998951]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'Country - BaggingClassifier',data:[[0.990943755958,0.597799775743]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'Taster - AdaBoostClassifier',data:[[0.370829361296,0.379793741007]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'Category - AdaBoostClassifier',data:[[0.780266920877,0.790694764602]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'Country - AdaBoostClassifier',data:[[0.491420400381,0.500733889325]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'Taster - DecisionTree',data:[[0.60962821735,0.512894796364]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'Category - DecisionTree',data:[[0.902764537655,0.833905205561]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'Country - DecisionTree',data:[[0.60819828408,0.56407936592]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'Taster - RandomForestClassifier',data:[[0.992850333651,0.551648965096]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'Category - RandomForestClassifier',data:[[0.988560533842,0.832899416257]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'Country - RandomForestClassifier',data:[[0.994280266921,0.580686920408]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'Taster - ExtraTreesClassifier',data:[[1,0.51779380838]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'Category - ExtraTreesClassifier',data:[[1,0.814010789376]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'Country - ExtraTreesClassifier',data:[[1,0.561268930735]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'Taster - GradientBoostingClassifier',data:[[0.999523355577,0.636462509083]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'Category - GradientBoostingClassifier',data:[[0.99714013346,0.870224305452]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'Country - GradientBoostingClassifier',data:[[0.995710200191,0.624893525893]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'Taster - C-Support Vector Classification',data:[[0.989037178265,0.729202057778]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'Category - C-Support Vector Classification',data:[[0.991420400381,0.886119626366]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'Country - C-Support Vector Classification',data:[[0.981410867493,0.660861321386]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}},
{name:'Taster - KNeighbors',data:[[0.999523355577,0.636597255976]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'square'}},
{name:'Category - KNeighbors',data:[[0.99714013346,0.870397551456]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'diamond'}},
{name:'Country - KNeighbors',data:[[0.996663489037,0.62598112581]],marker:{fillColor:'rgba(255,0,0,.5)',symbol:'triangle'}}]
});
});
train_size=.1
pca_i=50
output_data = return_model_data(train_size=train_size, pca_i=pca_i, input_df=df2, keep_vars=['points','price_per_liter'])
XTrain, XTest, y1_train, y1_test, y2_train, y2_test, y3_train, y3_test,lookupTable1,lookupTable2,lookupTable3,vectorizer,normalizer1,svd1,scaler = output_data
mlpc = MLPClassifier(solver='lbfgs', alpha=1e-5, random_state=41)
bgc = BaggingClassifier(random_state=41)
abc = AdaBoostClassifier(random_state=41)
dtc = DecisionTreeClassifier(max_depth=13,random_state=41)
rfc = RandomForestClassifier( random_state=41)
etc = ExtraTreesClassifier( random_state=41)
gbc = GradientBoostingClassifier(random_state=41)
svc = SVC(kernel='rbf',C=3)
knc = KNeighborsClassifier(n_neighbors=5)
eclf = VotingClassifier(estimators=[('mlpc', mlpc),
('bgc', bgc),
('abc', abc),
('dtc', dtc),
('rfc', rfc),
('etc', etc),
('gbc', gbc),
('svc', svc),
('knc', knc)], voting='hard')
list_of_models = [mlpc,bgc,abc,dtc,rfc,etc,gbc,svc,knc,eclf ]
list_of_models_names = ['MLP', 'Bagging', 'AdaBoostr', 'DecisionTree',
'RandomForest', 'ExtraTrees', 'GradientBoosting', 'C-Support Vector Classification',
'KNeighbors','VotingEnsemble']
for clf, label in zip(list_of_models, list_of_models_names):
scores = cross_val_score(clf, XTrain, y1_train, cv=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s] %s" % (scores.mean(), scores.std(), label, "y1"))
#scores = cross_val_score(clf, XTrain, y2_train, cv=3, scoring='accuracy')
#print("Accuracy: %0.2f (+/- %0.2f) [%s] %s" % (scores.mean(), scores.std(), label, "y2"))
#scores = cross_val_score(clf, XTrain, y3_train, cv=3, scoring='accuracy')
#print("Accuracy: %0.2f (+/- %0.2f) [%s] %s" % (scores.mean(), scores.std(), label, "y3"))
mlpc = MLPClassifier(solver='lbfgs', alpha=1e-5, random_state=41)
gbc = GradientBoostingClassifier(random_state=41)
svc = SVC(kernel='rbf',C=3)
eclf = VotingClassifier(estimators=[('mlpc', mlpc),
('gbc', gbc),
('svc', svc)], voting='hard')
scores = cross_val_score(eclf, XTrain, y1_train, cv=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s] %s" % (scores.mean(), scores.std(), label, "y1"))
svc = SVC(kernel='rbf',C=3)
svc.fit(XTrain, y1_train)
y_pred_train = svc.predict(XTrain)
y_pred_test = svc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y1_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y1_test, y_pred = y_pred_test))
svc = SVC(kernel='rbf',C=3)
svc.fit(XTrain, y2_train)
y_pred_train = svc.predict(XTrain)
y_pred_test = svc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y2_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y2_test, y_pred = y_pred_test))
svc = SVC(kernel='rbf',C=3)
svc.fit(XTrain, y3_train)
y_pred_train = svc.predict(XTrain)
y_pred_test = svc.predict(XTest)
print(train_size,"\t",pca_i, "\t",
metrics.accuracy_score(y_true = y3_train, y_pred = y_pred_train),"\t",
metrics.accuracy_score(y_true = y3_test, y_pred = y_pred_test))
class neuralNetwork:
def __init__(self, input_nodes, hidden_nodes, output_nodes, learning_rate, lookupTable1, lookupTable2,
lookupTable3,weights_input_to_hidden,weights_hidden_to_output):
self.input_nodes = input_nodes
self.hidden_nodes = hidden_nodes
self.output_nodes = output_nodes
self.weights_input_to_hidden = weights_input_to_hidden
self.weights_hidden_to_output = weights_hidden_to_output
self.learning_rate = learning_rate
self.lookupTable1 = lookupTable1
self.lookupTable2 = lookupTable2
self.lookupTable3 = lookupTable3
self.y1_n = len(lookupTable1)
self.y2_n = len(lookupTable2)
self.y3_n = len(lookupTable3)
self.e = 0
pass
def pickle(self):
joblib.dump(self.weights_input_to_hidden, 'weights_input_to_hidden'+str(self.e)+'.pkl')
joblib.dump(self.weights_hidden_to_output, 'weights_hidden_to_output'+str(self.e)+'.pkl')
joblib.dump(self.lookupTable1, 'lookupTable1'+str(self.e)+'.pkl')
joblib.dump(self.lookupTable2, 'lookupTable2'+str(self.e)+'.pkl')
joblib.dump(self.lookupTable3, 'lookupTable3'+str(self.e)+'.pkl')
joblib.dump(self.input_nodes, 'input_nodes'+str(self.e)+'.pkl')
joblib.dump(self.hidden_nodes, 'hidden_nodes'+str(self.e)+'.pkl')
joblib.dump(self.output_nodes, 'output_nodes'+str(self.e)+'.pkl')
pass
def activation_function(self,x):
return sc.special.expit(x)
def inverse_activation_function(self,x):
return sc.special.logit(x)
def get_e(self):
return self.e
def get_lookup_Tables(self):
return (self.lookupTable1, self.lookupTable2, self.lookupTable3)
def train(self, inputs_list, targets_list):
inputs = np.array(inputs_list, ndmin=2).T
targets = np.array(targets_list, ndmin=2).T
hidden_inputs = np.dot(self.weights_input_to_hidden, inputs)
hidden_outputs = self.activation_function(hidden_inputs)
final_inputs = np.dot(self.weights_hidden_to_output, hidden_outputs)
final_outputs = self.activation_function(final_inputs)
output_errors = targets - final_outputs
hidden_errors = np.dot(self.weights_hidden_to_output.T, output_errors)
self.weights_hidden_to_output += self.learning_rate * np.dot(
(output_errors * final_outputs * (1.0 - final_outputs)), np.transpose(hidden_outputs))
self.weights_input_to_hidden += self.learning_rate * np.dot(
(hidden_errors * hidden_outputs * (1.0 - hidden_outputs)), np.transpose(inputs))
self.e += 1
pass
def train_df(self, XTrain, y1_train, y2_train, y3_train):
for x, y1, y2, y3 in zip(XTrain, y1_train, y2_train, y3_train):
inputs = np.asfarray(x)
targets = np.zeros(self.output_nodes) + 0.01
targets[y1] = .99
targets[self.y1_n + y2] = .99
targets[self.y1_n + self.y2_n + y3] = .99
self.train(inputs, targets)
pass
pass
def query(self, inputs_list):
inputs = np.array(inputs_list, ndmin=2).T
hidden_inputs = np.dot(self.weights_input_to_hidden, inputs)
hidden_outputs = self.activation_function(hidden_inputs)
final_inputs = np.dot(self.weights_hidden_to_output, hidden_outputs)
final_outputs = self.activation_function(final_inputs)
return final_outputs
def back_query(self, targets_list):
final_outputs = np.array(targets_list, ndmin=2).T
final_inputs = self.inverse_activation_function(final_outputs)
hidden_outputs = np.dot(self.weights_hidden_to_output.T, final_inputs)
hidden_outputs -= np.min(hidden_outputs)
hidden_outputs /= np.max(hidden_outputs)
hidden_outputs *= 0.98
hidden_outputs += 0.01
hidden_inputs = self.inverse_activation_function(hidden_outputs)
inputs = np.dot(self.weights_input_to_hidden.T, hidden_inputs)
return inputs
def get_target(self, country, category, taster):
targets = np.zeros(self.output_nodes) + 0.01
targets[self.lookupTable1.tolist().index(taster)] = .99
targets[self.y1_n + self.lookupTable2.tolist().index(category)] = .99
targets[self.y1_n + self.y2_n + self.lookupTable3.tolist().index(country)] = .99
return targets
def create_word_cloud(self, plot_cloud, cuttoff, abs_pass, country, category, taster, vectorizer, scaler, svd, map_mask_path):
my_input = self.back_query(self.get_target(country, category, taster))
my_input2 = scaler.inverse_transform([x[0] for x in my_input])
my_input3 = svd.inverse_transform(my_input2[2:].reshape(1, -1))[0]
my_features = vectorizer.get_feature_names()
if abs_pass:
high_indexes = np.where(np.abs(my_input3)>cuttoff)
else:
high_indexes = np.where(my_input3>cuttoff)
my_dic ={}
for x in high_indexes[0]:
my_dic[my_features[x]]= math.floor(my_input3[x]*1000)
if plot_cloud:
map_mask = np.array(Image.open(map_mask_path))
wc = WordCloud(background_color="white", max_words=2000, mask=map_mask)
wc.generate_from_frequencies(my_dic)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
return my_dic
def get_accuracy(self, XTrain, y1_train, y2_train, y3_train):
count_right1 = 0
count_right2 = 0
count_right3 = 0
count_total = 0
for x, y1, y2, y3 in zip(XTrain, y1_train, y2_train, y3_train):
inputs = np.asfarray(x)
predicted_value = self.query(inputs)
y_1_prediction = np.argmax([predicted_value[i][0] for i in range(0, self.y1_n)])
y_2_prediction = np.argmax([predicted_value[i][0] for i in range(self.y1_n, self.y1_n + self.y2_n)])
y_3_prediction = np.argmax(
[predicted_value[i][0] for i in range(self.y1_n + self.y2_n, self.y1_n + self.y2_n + self.y3_n)])
count_total += 1
if y1 == y_1_prediction:
count_right1 += 1
if y2 == y_2_prediction:
count_right2 += 1
if y3 == y_3_prediction:
count_right3 += 1
print(self.hidden_nodes, self.e, count_total, count_right1 / count_total, count_right2 / count_total,
count_right3 / count_total, count_total, sep='\t')
return (self.hidden_nodes, self.e, count_total, count_right1 / count_total, count_right2 / count_total,
count_right3 / count_total, count_total)
def labeled_query(self, x):
inputs = np.asfarray(x)
predicted_value = self.query(inputs)
y_1_prediction = np.argmax([predicted_value[i][0] for i in range(0, self.y1_n)])
y_2_prediction = np.argmax([predicted_value[i][0] for i in range(self.y1_n, self.y1_n + self.y2_n)])
y_3_prediction = np.argmax( [predicted_value[i][0] for i in range(self.y1_n + self.y2_n, self.y1_n + self.y2_n + self.y3_n)])
return(self.lookupTable1[y_1_prediction],
self.lookupTable2[y_2_prediction],
self.lookupTable3[y_3_prediction])
train_size=.5
pca_i=100
output_data = return_model_data(train_size=train_size, pca_i=pca_i, input_df=df2, keep_vars=['points','price_per_liter'], save_tools = True)
XTrain, XTest, y1_train, y1_test, y2_train, y2_test, y3_train, y3_test, lookupTable1, lookupTable2, lookupTable3, vectorizer, normalizer1, svd1, scaler = output_data
y1_n= len(lookupTable1)
y2_n= len(lookupTable2)
y3_n= len(lookupTable3)
# number of input, hidden and output nodes
input_nodes = XTrain.shape[1]
hidden_nodes = 100
output_nodes = y1_n + y2_n + y3_n
# learning rate
learning_rate = 0.001
weights_input_to_hidden = np.random.normal(0.0, pow(input_nodes, -0.5),
(hidden_nodes, input_nodes))
weights_hidden_to_output = np.random.normal(0.0, pow(hidden_nodes, -0.5),
(output_nodes, hidden_nodes))
n = neuralNetwork(input_nodes, hidden_nodes, output_nodes, learning_rate,
lookupTable1, lookupTable2, lookupTable3,
weights_input_to_hidden=weights_input_to_hidden,
weights_hidden_to_output=weights_hidden_to_output)
n.pickle()
epochs = 1000
for e in range(epochs):
XTrain, y1_train, y2_train, y3_train = shuffle(XTrain, y1_train,y2_train,y3_train)
n.train_df(XTrain,y1_train, y2_train, y3_train)
n.get_accuracy(XTrain, y1_train, y2_train, y3_train)
n.get_accuracy(XTest, y1_test, y2_test, y3_test)
n.pickle()
pass
cor_array = []
for x,y1,y2,y3 in zip(XTrain,y1_train, y2_train, y3_train):
inputs = np.asfarray(x)
targets = np.zeros(output_nodes) + 0.01
targets[y1] = .99
targets[y1_n+y2] = .99
targets[y1_n+y2_n+y3] = .99
cor_array.append(sc.stats.spearmanr(np.asfarray([x[0] for x in n.back_query(targets)]),inputs).correlation)
pass
import matplotlib.pyplot as plt
plt.hist(cor_array, bins='auto') # arguments are passed to np.histogram
plt.title("Histogram with 'auto' bins")
plt.show()
n.create_word_cloud(plot_cloud=True, cuttoff=0.01, abs_pass=False,
country='US', category='Red', taster='Paul Gregutt',
vectorizer=vectorizer,scaler= scaler, svd=svd1 ,
map_mask_path="wine2_removed.png")
Make sure that you plan your work so that you can avoid a big rush right before the final project deadline, and delegate different modules and responsibilities among your team members. Write this in terms of weekly deadlines.
| Date | Item | Assigned | Details |
|---|---|---|---|
| Finish March 27, 2018 | Acquire Data | Brian Lead | Pulling data from https://www.winemag.com using python. |
| Finish March 29, 2018 | Clean Data | Li Lead | Start on prelim data. |
| Finish March 29, 2018 | Explore Data | Trevor Lead | Start on prelim data |
| Start March 29, 2018 | Start Analysis | Team | Run prelim models on full data before milestone. |
| Due Apr 1, 2018 | Project Milestone Due | Team |
For your milestone, we expect you to have acquired, cleaned, and explored your dataset. You should also explain in more detail what will go into your final analysis. Explain deviations from your initial project plan. |
| Finish Apr 6, 2018 | Finish Analysis | Team |
What are the key drivers for wine points and wine price? Penalised linear models (elastic net) and trees. Which wines are systematically over or under valued in terms of price? Penalised linear models, random forests, gradient boosting machines, k-NNs, SVMs, and Neural Networks. What attributes per wine should you use to market them? PCA, FA, SEMs, and Bayesian Networks. How similar are different wines varieties? Clustering and profiling. |
| Finish Apr 6, 2018 | Document findings | Team | Update Jupyter Notebook. |
| Finish Apr 13, 2018 | Create Prediction Tool | Team | Create Html tool to output results and visualization dynamically |
| ? | Project Review with the staff | Team | |
| Finish Apr 13, 2018 | Develop screen-cast | Team |
You must also include a three minute video including audio walking us through your project. Each team will create a three minute screen-cast with narration showing a demo of your project and/or some slides. You can use any screencast tool of your choice. Please make sure that the sound quality of your video is good. Upload the video to an online video-platform such as YouTube or Vimeo and link to it from your notebook. We will strictly enforce the three minute time limit for the video, so please make sure you are not running longer. Use principles of good storytelling and presentations to get your key points across. Focus the majority of your screencast on your main contributions rather than on technical details. What do you feel is the best part of your project? What insights did you gain? What is the single most important thing you would like your audience to take away? Make sure it is front and center rather than at the end. |
| Finish Apr 20, 2018 | Complete peer evaluations | Team |
It is important to provide positive feedback to people who truly worked hard for the good of the team and to also make suggestions to those you perceived not to be working as effectively on team tasks. We ask you to provide an honest assessment of the contributions of the members of your team, including yourself. The feedback you provide should reflect your judgment of each team member’s:
Your teammate’s assessment of your contributions and the accuracy of your self-assessment will be considered as part of your overall project score. |
| Due Apr 22, 2018 | Final Project Due | Team |
For your final project you must complete the analysis in your notebook and present your results in a compelling way. |
| ? | Project Presentation | Team |
Each team will be given a brief slot (~5 minutes) to present their project in one of the two last lectures. Present your analysis questions and your main contributions, but also explain your methods and justify your choices. What do you feel is the best part of your project? What insights did you gain? What is the single most important thing you would like your audience to take away? |
Group Name: Weather our power consumption will change?
Reviewers: Brian Tillman and Trevor Olsen
Group Member Names: Aaron Young
Objective: Identifying how much power is being used depending on the weather forecast. Witch buildings use the most? Model will not handle real time data but will predict daily based on weather.
Dataset: Collecting weather data from different sources, and power consumption of University of Utah buildings.
Data Processing: Data must be collected in the middle of the month because of pay system first week of the month. Cleaning the data is going to be time intensive, the project is going to depend on how fast they can collect the data.
Exploratory Analysis: Use the client tool provided by the university to do introductory information and explore the variables that are affected the most by weather. Built in visualization tools will help with analysis.
Analysis Methods: None linear auto regressive with exogenous Data Visualization for weather factors on power consumption. Focus on the power usage not cost.
Must have Features: Want to have good model to correlate with weather data. Pick what building the user is looking at. Overall energy consumption on campus (Sum all equipment on campus).
Optional Features: Explore the cost model Heat map the data on to the university of Utah campus map. Add energy usage to the campus map Separate models for bigger buildings
Schedule: Data Collected over spring break
General Questions:
• Very interesting and unique data set. Operational data collection and presenting to the university facility board. • The model will be daily not real time data. Pulling slices of data vs the model.
Data Acquisition and Clean Up:
• The data acquisition is going to need a bit of clean up because the data is large and trouble accessing. The project will solve a solution for other building managers. • Go to browsers-based API and then scrap for unique identifiers. • Hundreds of buildings and each building has at least 4 sections in the building each contacting 4 parts. No hospital data and other buildings are restricted. • Gathering weather data from multiple sources. • How are you going to store the data? Pull the data and then pull data for a training set, then use the data from one data to
Analysis Methodology:
• Equipment clustering could be used to visualize the coalition to each other and weather. In a moment of time the equipment’s will have a different distance. • Random effects model with in the time frame. • Heat map the data on to the university of Utah campus map could be more important for the project. • Equipment is going to vary, and you can find meters that are not correlated with each other.